fix deleteat! and subset! performance #3249

bkamins · 2022-12-15T19:27:03Z

Explanation in https://discourse.julialang.org/t/learning-to-benchmark-and-find-the-best-function-to-select-a-subset-of-a-dataframe/91704/12.

Benchmarks

This PR

julia> using Random

julia> Random.seed!(1234)
TaskLocalRNG()

julia> df = DataFrame(rand(10^6, 100), :auto);

julia> df.id = rand(1:100, 10^6);

julia> inds = rand(Bool, 10^6);

julia> x = copy(df); @time deleteat!(x, inds);
  0.084989 seconds (29.05 k allocations: 1.471 MiB, 11.83% compilation time)

julia> x = copy(df); @time deleteat!(x, inds);
  0.074356 seconds (305 allocations: 4.766 KiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.091337 seconds (485 allocations: 262.422 KiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.148983 seconds (485 allocations: 262.422 KiB)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));     
  0.207012 seconds (1.15 M allocations: 63.651 MiB, 31.11% compilation time)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));     
  0.216079 seconds (1.15 M allocations: 63.632 MiB, 30.41% compilation time)

1.4.4 release

julia> using Random

julia> Random.seed!(1234)
TaskLocalRNG()

julia> df = DataFrame(rand(10^6, 100), :auto);

julia> df.id = rand(1:100, 10^6);

julia> inds = rand(Bool, 10^6);

julia> x = copy(df); @time deleteat!(x, inds);
  0.300327 seconds (307 allocations: 3.817 MiB)

julia> x = copy(df); @time deleteat!(x, inds);
  0.303013 seconds (307 allocations: 3.817 MiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.307057 seconds (483 allocations: 4.074 MiB)

julia> x = copy(df); @time subset!(x, :x1 => Returns(inds));
  0.300665 seconds (483 allocations: 4.074 MiB)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));
  0.437468 seconds (1.15 M allocations: 67.427 MiB, 15.85% compilation time)

julia> x = groupby(copy(df), :id); @time subset!(x, :x1 => x -> rand(Bool, length(x)));
  0.417350 seconds (1.15 M allocations: 67.425 MiB, 15.70% compilation time)

nalimilan · 2022-12-17T17:53:44Z

Interesting. Have you checked with more columns and with a lower percentage of dropped rows? I would expect the findall approach to be less slow (and maybe faster) in these cases.

src/dataframe/dataframe.jl

bkamins · 2022-12-17T18:37:11Z

I would expect the findall approach to be less slow (and maybe faster) in these cases.

You are right! 🧠

The threshold value I assessed empirically is less than 5% observations when it is better (it probably also depends on number of columns, but I wanted to have something relatively simple). I have proposed an adaptive algorithm switching between two approaches as needed.

nalimilan · 2022-12-17T20:35:37Z

Cool. Can you check when there are many columns? That's a use case that we care about.

bkamins · 2022-12-17T22:45:56Z

Here is an example: 100 columns, 10^6 rows. Tested with 5.5% rows to drop (so a bit above 5% threshold).

Setup:

df = DataFrame(rand(10^6, 100), :auto)

This PR:

julia> t = 0.055;

julia> Random.seed!(1234);

julia> idx = rand(10^6) .< t;

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.095975 seconds (303 allocations: 4.734 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.096654 seconds (303 allocations: 4.734 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.094744 seconds (303 allocations: 4.734 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.094374 seconds (303 allocations: 4.734 KiB)

Current release:

julia> t = 0.055;

julia> Random.seed!(1234);

julia> idx = rand(10^6) .< t;

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.092951 seconds (304 allocations: 432.203 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.087335 seconds (304 allocations: 432.203 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.091471 seconds (304 allocations: 432.203 KiB)

julia> dfc = copy(df); @time deleteat!(dfc, idx);
  0.092145 seconds (304 allocations: 432.203 KiB)

I did some more tests on even wider tables and it seems that a more precise threshold is 6% on my laptop, so I changed it to that value.

nalimilan · 2022-12-18T10:36:20Z

OK, great!

bkamins · 2022-12-18T14:13:13Z

Thank you!

fix deleteat! and subset! performance

e602288

bkamins requested a review from nalimilan December 15, 2022 19:27

bkamins added the performance label Dec 15, 2022

bkamins added this to the patch milestone Dec 15, 2022

nalimilan reviewed Dec 17, 2022

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

implement adaptive algorithm for finding rows to delete

54f1f4f

fixes after code review

5a26d48

nalimilan approved these changes Dec 18, 2022

View reviewed changes

bkamins merged commit b240458 into main Dec 18, 2022

bkamins deleted the bk/deleteat branch December 18, 2022 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix deleteat! and subset! performance #3249

fix deleteat! and subset! performance #3249

bkamins commented Dec 15, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

nalimilan commented Dec 18, 2022

bkamins commented Dec 18, 2022

fix deleteat! and subset! performance #3249

fix deleteat! and subset! performance #3249

Conversation

bkamins commented Dec 15, 2022

Benchmarks

This PR

1.4.4 release

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

nalimilan commented Dec 17, 2022

bkamins commented Dec 17, 2022

nalimilan commented Dec 18, 2022

bkamins commented Dec 18, 2022